Use helpers in your recordExtractor
to make it easier to extract relevant content from your page.
Algolia has a selection of helpers:
product
article
page
splitContentIntoRecords
codeSnippets
docsearch
.
product
This helper extracts content from product pages.
A “product page” is an HTML page with one of thes JSON-LD schema types:
recordExtractor: ({ url, $, helpers }) => {
return helpers.product({ url, $ });
}
Response
The helper returns an object with the following properties:
The product page’s URL (without parameters or hashes).
The sku
field of the JSON-LD schema.
The description
field of the JSON-LD schema.
The image
field of the JSON-LD schema.
The product’s price, selected from one of these JSON-LD schema fields, in the order:
offers.price
offers.highPrice
offers.lowPrice
.
The offers.priceCurrency
field of the JSON-LD schema.
The category
field of the JSON-LD schema.
article
This helper extracts content from article pages.
An “article page” is an HTML page with an appropriate JSON-LD schema or meta tag:
One of these JSON-LD schema types:recordExtractor: ({ url, $, helpers }) => {
return helpers.article({ url, $ });
}
Response
The helper returns an object with the following properties:
The article’s URL (without parameters or hashes).
The article’s headline, selected from one of these, in the order:
meta[property="og:title"]
meta[name="twitter:title"]
head > title
- First
<h1>
.
The article’s description, selected from one of these, in the order:
meta[name="description"]
meta[property="og:description"]
meta[name="twitter:description"]
.
The keywords
field of the JSON-LD schem.
Article tags: meta[property="article:tag"]
.
The image associated with the article, selected from one of these, in the order:
meta[property="og:image"]
meta[name="twitter:image"]
.
The author
field of the JSON-LD schema.
The datePublished
field of the JSON-LD schema.
The dateModified
field of the JSON-LD schema.
The category
field of the JSON-LD schema.
The article’s content (body copy).
page
This helper extracts text from pages regardless of its type or category.
recordExtractor: ({ url, $, helpers }) => {
return helpers.page({
url,
$,
recordProps: {
title: 'head title',
content: 'body',
},
});
}
Response
The helper returns an object with the following properties:
The object’s unique identifier.
The URL hostname (for example, example.com
).
The URL path: everything after the hostname.
The URL depth, based on the number of slashes after the domain.
For example, http://example.com/
= 1, http://example.com/about
= 1, http://example.com/about/
= 2.
The page’s file type.
One of: html
, xml
, json
, pdf
, doc
, xls
, ppt
, odt
, ods
, odp
, or email
.
The page length in bytes.
The page title, derived from head > title
.
The page’s description, derived from meta[name="description"]
.
The page’s keywords, derived from meta[name="keywords"]
.
The image associated with the page, derived from meta[property="og:image"]
.
The page’s section titles, derived from h1
and h2
.
The page’s content (body copy).
splitContentIntoRecords
This helper extracts text from long HTML pages and splits them into smaller chunks.
This can help prevent “Record too big” errors.
Using this example record extractor on a long page returns an array of records,
each one smaller than 1,000 bytes.
recordExtractor: ({ url, $, helpers }) => {
const baseRecord = {
url,
title: $('head title').text().trim(),
};
const records = helpers.splitContentIntoRecords({
baseRecord,
$elements: $('body'),
maxRecordBytes: 1000,
textAttributeName: 'text',
orderingAttributeName: 'part',
});
// Produced records can be modified after creation, if necessary.
return records;
}
When splitting pages, some words will appear in records belonging to the same page.
If you don’t want these duplicates to turn up when users search:
- Set
distinct
to true
in your index. distinct: true
- Set the
attributeForDistinct
to your page’s URL. For example, attributeForDistinct: 'url'
.
- Set
searchableAttributes
’ to be your page title and body content. For example, [ 'searchableAttributes: [ 'title', 'text' ]
.
- Add a
customRanking
to sort from the first split record on your page to the last. For example, customRanking: [ 'asc(part)' ]
.
initialIndexSettings: {
'my-index': {
distinct: true,
attributeForDistinct: 'url'
searchableAttributes: [ 'title', 'text' ],
customRanking: [ 'asc(part)' ],
}
}
Response
Specify one or more response parameters in your helper to determine what information is returned.
Takes this record’s attributes (and values) and adds them to all the split records.
$elements
string
default:"$('body')"
This attribute stores the sequentially generated number assigned to each record when the helper splits a page.
Name of the attribute in which to store the text of each split record.
codeSnippets
Use this helper to extract code snippets from pages.
The helper finds code snippets by looking for <pre>
tags and extracting the content
and the language class prefix from the tag.
If the crawler finds several code snippets on a page, the helper returns a list of those snippets.
recordExtractor: ({ url, $, helpers }) => {
const code = helpers.codeSnippets({ tag, languageClassPrefix })
return { code };
}
Response
The helper returns an array of code objects with the following properties:
The code snippet’s language (if found).
The URL of the nearest sibling <a>
tag.
Text fragment URL with the code snippet.
This is a selection of text within a page that’s linked to another page.
docsearch
This helper extracts content and formats it to be compatible with DocSearch.
It creates an optimized number of records for relevancy and hierarchy.
You can also use it without DocSearch or to index non-documentation content.
For more information, see the DocSearch documentation.
recordExtractor: ({ url, $, helpers }) => {
return helpers.docsearch({
aggregateContent: true,
indexHeadings: true,
recordVersion: 'v3',
recordProps: {
lvl0: {
selectors: "header h1",
},
lvl1: "article h2",
lvl2: "article h3",
lvl3: "article h4",
lvl4: "article h5",
lvl5: "article h6",
content: "main p, main li",
},
});
}